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Abstract 

Associating distinct groups of objects (clusters) with contiguous regions of high prob¬ 
ability density (high-density clusters), is central to many statistical and machine learning 
approaches to the classification of unlabelled data. We propose a novel hyperplane classifier 
for clustering and semi-supervised classification which is motivated by this objective. The 
proposed minimum density hyperplane minimises the integral of the empirical probabil¬ 
ity density function along it, thereby avoiding intersection with high density clusters. We 
show that the minimum density and the maximum margin hyperplanes are asymptotically 
equivalent, thus linking this approach to maximum margin clustering and semi-supervised 
support vector classifiers. We propose a projection pursuit formulation of the associated 
optimisation problem which allows us to find minimum density hyperplanes efficiently in 
practice, and evaluate its performance on a range of benchmark data sets. The proposed 
approach is found to be very competitive with state of the art methods for clustering and 
semi-supervised classification. 

Keywords: low-density separation, high-density clusters, clustering, semi-supervised 

classification, projection pursuit 

1. Introduction 

We study the fundamental learning problem; Given a random sample from an unknown 
probability distribution with no, or partial label information, identify a separating hyperplane 
that avoids splitting any of the distinct groups (clusters) present in the sample. We adopt 
the cluster definition given by Hartigan (1975, chap. 11), in which a high-density cluster 
is defined as a maximally connected component of the level set of the probability density 
function, p(x), at level c ^ 0, 


levcp(x) = |x G I p(x) > c| . 
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An important advantage of this approach over other methods is that it is well founded 
from a statistical perspective, in the sense that a well-defined population quantity is being 
estimated. 

However, since p(x) is typically unknown, detecting high-density clusters necessarily 
involves estimates of this function, and standard approaches to nonparametric density es¬ 
timation are reliable only in low dimensions. A number of existing density clustering al¬ 
gorithms approximate the level sets of the empirical density through a union of spheres 
around points whose estimated density exceeds a user-defined threshold (Walther, 1997; 
Cuevas et ah, 2000, 2001; Rinaldo and Wasserman, 2010). The choice of this threshold 
affects both the shape and number of detected clusters, while an appropriate threshold is 
typically not known in advance. The performance of these methods deteriorates sharply as 
dimensionality increases, unless the clusters are assumed to be clearly discernible (Rinaldo 
and Wasserman, 2010). An alternative is to consider the more specific problem of allocating 
observations to clusters, which shifts the focus to local properties of the density, rather than 
its global approximation. The central idea underlying such methods is that if a pair of ob¬ 
servations belong to the same cluster they must be connected through a path traversing only 
high-density regions. Graph theory is a natural choice to address this type of problem. Az- 
zalini and Torelli (2007); Stuetzle and Nugent (2010) and Menardi and Azzalini (2014) have 
recently proposed algorithms based on this approach. Even these approaches however are 
limited to problems of low dimensionality by the standards of current applications (Menardi 
and Azzalini, 2014). 

An equivalent formulation of the density clustering problem is to assume that clusters are 
separated through contiguous regions of low probability density; known as the low-density 
separation assumption. In both clustering and semi-supervised classification, identifying 
the hyperplane with the maximum margin is considered a direct implementation of the low- 
density separation approach. Motivated by the success of support vector machines (SVMs) 
in classification, maximum margin clustering (MMC) (Xu et ah, 2004), seeks the maximum 
margin hyperplane to perform a binary partition (bi-partition) of unlabelled data. MMC 
can be equivalently viewed as seeking the binary labelling of the data sample that will 
maximise the margin of an SVM estimated using the assigned labels. 

In a plethora of applications data can be collected cheaply and automatically, while 
labelling observations is a manual task that can be performed for a small proportion of the 
data only. Semi-supervised classifiers attempt to exploit the abundant unlabelled data to im¬ 
prove the generalisation error over using only the scarce labelled examples. Unlabelled data 
provide additional information about the marginal density, p(x), but this is beneficial only 
insofar as it improves the inference of the class conditional density, p{x.\y). Semi-supervised 
classification relies on the assumption that a relationship between p(x) and p(x|y) exists. 
The most frequently assumed relationship is that high-density clusters are associated with 
a single class (cluster assumption), or equivalently that class boundaries pass through low- 
density regions (low-density separation assumption). The most widely used semi-supervised 
classifier based on the low-density separation assumption is the semi-supervised support vec¬ 
tor machine (S^VM) (Vapnik and Sterin, 1977; Joachims, 1999; Chapelle and Zien, 2005). 
S^VMs implement the low-density separation assumption by partitioning the data according 
to the maximum margin hyperplane with respect to both labelled and unlabelled data. 
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Encouraging theoretical results for semi-supervised classification have been obtained un¬ 
der the cluster assumption. If p(x) is a mixture of class conditional distributions, Castelli 
and Cover (1995, 1996) have shown that the generalisation error will be reduced exponen¬ 
tially in the number of labelled examples if the mixture is identifiable. More recently, Singh 
et al. (2009) showed that the mixture components can be identified if p(x) is a mixture of a 
finite number of smooth density functions, and the separation between mixture components 
is large. Rigollet (2007) considers the cluster assumption in a nonparametric setting, that 
is in terms of density level sets, and shows that the generalisation error of a semi-supervised 
classifier decreases exponentially given a sufficiently large number of unlabelled data. How¬ 
ever, the cluster assumption is difficult to verify with a limited number of labelled examples. 
Furthermore, the algorithms proposed by Rigollet (2007) and Singh et al. (2009) are difficult 
to implement efficiently even if the cluster assumption holds. This renders them impractical 
for real-world problems (Ji et al., 2012). 

Although intuitive, the claim that maximising the margin over (labelled and) unlabelled 
data is equivalent to identifying the hyperplane that goes through regions with the lowest 
possible probability density has received surprisingly little attention. The work of Ben- 
David et al. (2009) is the only attempt we are aware of to theoretically investigate this 
claim. Ben-David et al. (2009) quantify the notion of a low-density separator by defining 
the density on a hyperplane, as the integral of the probability density function along the 
hyperplane. They study the existence of universally consistent algorithms to compute the 
hyperplane with minimum density. The maximum hard margin classifier is shown to be 
consistent only in one dimensional problems. In higher dimensions only a soft-margin 
algorithm is a consistent estimator of the minimum density hyperplane. Ben-David et al. 
(2009) do not provide an algorithm to compute low density hyperplanes. 

This paper introduces a novel approach to clustering and semi-supervised classification 
which directly identifies low-density hyperplanes in the finite sample setting. In this ap¬ 
proach the density on a hyperplane criterion proposed by Ben-David et al. (2009) is directly 
minimised with respect to a kernel density estimator that employs isotropic Gaussian ker¬ 
nels. The density on a hyperplane provides a uniform upper bound on the value of the 
empirical density at points that belong to the hyperplane. This bound is tight and propor¬ 
tional to the density on the hyperplane. Therefore, the smallest upper bound on the value 
of the empirical density on a hyperplane is achieved by hyperplanes that minimise the den¬ 
sity on a hyperplane criterion. An important feature of the proposed approach is that the 
density on a hyperplane can be evaluated exactly through a one-dimensional kernel density 
estimator, constructed from the projections of the data sample onto the vector normal to 
the hyperplane. This renders the computation of minimum density hyperplanes tractable 
even in high dimensional applications. 

We establish a connection between the minimum density hyperplane and the maximum 
margin hyperplane in the finite sample setting. In particular, as the bandwidth of the kernel 
density estimator is reduced towards zero, the minimum density hyperplane converges to 
the maximum margin hyperplane. An intermediate result establishes that there exists a 
positive bandwidth such that the partition of the data sample induced by the minimum 
density hyperplane is identical to that of the maximum margin hyperplane. 

The remaining paper is organised as follows: The formulation of the minimum den¬ 
sity hyperplane problem as well as basic properties are presented in Section 2. Section 3 
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establishes the connection between minimum density hyperplanes and maximum margin 
hyperplanes. Section 4 discusses the estimation of minimum density hyperplanes and the 
computational complexity of the resulting algorithm. Experimental results are presented in 
Section 5, followed by concluding remarks and future research directions in Section 6. 


2. Problem Formulation 


We study the problem of estimating a hyperplane to partition a hnite data set, X = 
C without splitting any of the high-density clusters present. We assume that X 
is an i.i.d. sample of a random variable X on with unknown probability density function 
p : —7- M+. A hyperplane is dehned as H[\r, b) := {x G | v • x = 6}, where without 

loss of generality we restrict attention to hyperplanes with unit normal vector, i.e., those 
parameterised by (v,6) G S'^~^ x M, where S'^~^ = {v G | ||v|| = 1}. Following Ben- 
David et al. (2009) we dehne the density on the hyperplane H{'v,b) as the integral of the 
probability density function along the hyperplane, 


p(x)dx. (1) 

J H{v.b) 

We approximate p(x) through a kernel density estimator with isotropic Gaussian kernels, 

1 ^ C W l|2 'I 

p{xU, ft V) = g “P { - I ■ (") 

This class of kernel density estimators has the useful property that the integral in Equa¬ 
tion (1) can be evaluated exactly by projecting X onto v; constructing a one-dimensional 
density estimator with Gaussian kernels and bandwidth h; and evaluating the density at b, 


Iiv,b\X,h^I) := 


>H{v,b) 
1 


p {x\X, h^l) dx. 




exp 


(6 - V • Xj 
2/i2 


■| =p{b\{^-:xi}^=i,h^) ■ 


(3) 


The univariate kernel estimator p (• | {v • approximates the projected density 

on V, that is, the density function of the random variable, = X • v. Henceforth we 
use /(v,6) to approximate I{v,b). To simplify terminology we refer to /(v,6) as the den¬ 
sity on H{'v,b), or the density integral on H{v,b), rather than the empirical density, or 
the empirical density integral, respectively. For notational convenience we write I{v,b) for 
I(v, b\X, h'^I), where X and h are apparent from context. 

The following Lemma, adapted from (Tasoulis et ah, 2010, Lemma 3), shows that /(v, b) 
provides an upper bound for the maximum value of the empirical density at any point that 
belongs to the hyperplane. 


Lemma 1 Let X = C and p{x\X,h?I) be a kernel density estimator with 

isotropic Gaussian kernels. Then, for any (v,6) G S'^~^ x M, 

max p{x\X,hf I) ^ ^ i{w,b), for all x ^ H{^r,b). (4) 

xgiT(v,fe) 
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This lemma shows that a hyperplane, H{'v,b), cannot intersect level sets of the empirical 

l-d ^ 

density with level higher than (27r/i^) ^ i(v,6). The proof of the lemma relies on the 
fact that projection contracts distances, and follows from simple algebra. In Equation (4) 
equality holds if and only if there exists x G H{v,b) and c G such that all x* G Tf, 
can be written as Xj = x + CjV. It is therefore not possible to obtain a uniform upper 
bound on the value of the empirical density at points that belong to H{v,b) that is lower 

l-d .. 

than (271 ^ j(v,6) using only one-dimensional projections. Since the upper bound of 
Lemma 1 is tight and proportional to /(v, 6), minimising the density on the hyperplane 
leads to the lowest upper bound on the maximum value of the empirical density along the 
hyperplane separator. 

To obtain hyperplane separators that are meaningful for clustering and semi-supervised 
classification, it is necessary to constrain the set of feasible solutions, because the density 
on a hyperplane can be made arbitrarily low by considering a hyperplane that intersects 
only the tail of the density. In other words, for any v, I(v,6) can be made arbitrarily 
low for sufficiently large \b\. In both problems the constraints restrict the feasible set to a 
subset of the hyperplanes that intersect the interior of the convex hull of X. In detail, let 
convTf denote the convex hull of X, and assume Int(convTf) ^ 0. Define C to be the set 
of hyperplanes that intersect Int(convTf), 


C = |lI(v, 6) (v, b) e S'^ ^ X M, 3z G Int(conv Tf) s.t. v • z = bj . 


(5) 


Then denote by F the set of feasible hyperplanes, where F C C. We define the minimum 
density hyperplane (MDH), H{v*,b*) £ F to satisfy. 


/(v*,6*) = 


min j(v,6). 
(v,b)|L7(v,fe)eF 


( 6 ) 


In the following subsections we discuss the specific formulations for clustering and semi- 
supervised classification in turn. 


2.1 Clustering 

Since high-density clusters are formed around the modes of p(x), the convex hull of these 
modes would be a natural choice to define the set of feasible hyperplanes. Unfortunately, 
this convex hull is unknown and difficult to estimate. We instead propose to constrain 
the distance of hyperplanes to the origin, b. Such a constraint is inevitable as for any 
V G /(v, 6) can become arbitrarily close to zero for sufficiently large |6|. Obviously, 

such hyperplanes are inappropriate for the purposes of bi-partitioning as they assign all 
the data to the same partition. Rather than fixing 6 to a constant, we constrain it in the 
interval, 

F(v) = [^v - acJv, + otav ], (7) 

where /Xv and denote the mean and standard deviation, respectively, of the projections 
{v-Xj}”^^. The parameter a ^ 0, controls the width of the interval, and has a probabilistic 
interpretation from Chebyshev’s inequality. Smaller values of a favour more balanced par¬ 
titions of the data at the risk of excluding low density hyperplanes that separate clusters 
more effectively. On the other hand, increasing a increases the risk of separating out only a 
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few outlying observations. We discuss in detail how to set this parameter in the experimen¬ 
tal results section. If Int(convAf) / 0, then there exists a > 0 such that the set of feasible 
hyperplanes for clustering, Fcl, satisfies, 


-Fcl 



(v,6) xM, 6gF(v)} cC, 


( 8 ) 


where C is the set of hyperplanes that intersect Int(convdf), as defined in Equation (5). 

The minimum density hyperplane for clustering is the solution to the following con¬ 
strained optimisation problem. 


min j(v,6), (9a) 

(v,6)e5'*-ixR 

subject to; b — fiy, + aa^ ^ 0, (9b) 

+ aiTv — 6^0. (9c) 


Since the objective function and the constraints are continuously differentiable, MDHs can 
be estimated through constrained optimisation methods like sequential quadratic program¬ 
ming (SQP). Unfortunately the problem of local minima due to the nonconvexity of the 
objective function seriously hinders the effectiveness of this approach. 

To mitigate this we propose a parameterised optimisation formulation, which gives rise 
to a projection pursuit approach. Projection pursuit methods optimise a measure of “inter¬ 
estingness” of a linear projection of a data sample, known as the projection index. For our 
problem the natural choice of projection index for v is the minimum value of the projected 
density within the feasible region, min^g^j-v) /(v, b). This index gives the minimum density 
integral of feasible hyperplanes with normal vector v. To ensure the differentiability of the 
projection index we incorporate a penalty term into the objective function. We define the 
penalised density integral as. 


/cl(v, 6) = /(v,6) -h 


— max {0, Uv — 

rje 


b,b — Hv — CUTv} 


l+e 


( 10 ) 


where, L = ^ ^ sup^giK 


dljvfi) 

db 


G (0,1) is a constant term that ensures 


that the penalty function is everywhere continuously differentiable, and r] G (0,1). Other 
penalty functions are possible, but we only consider the above due to its simplicity, and 
the fact that its parameters offer a direct interpretation: L in terms of the derivative of the 
projected density on v; and r] in terms of the desired accuracy of the minimisers of /cl(v, b) 
relative to the minimisers of Equation (9), as discussed in the following proposition. 


Proposition 2 For v G 5'^ define, the set of minimisers, 

i?(v) = argmin/(v, 6), (11) 

beF(v) 

Bciv) = argmin/cL(v,6) (12) 

beM 

For every b* G B{v) there exists bf, G i?c(v) such that \b* — bf,\ ^ rj. Moreover, there are 

no minimisers of /cl(v, b) outside the interval [yiv — auv — r], iXy + aa^ + f]], 

Bc{^) n - auv -h-,Fv + auv -k ??] = 0. 
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Proof 

Any minimiser in the interior of the feasible region, b* £ i?(v)nInt(-F(v)), also minimises 
the penalised function, since /cl(v, 6) = I(v,6) for all b £ Int(F(v)), hence b* £ Bc{v). 

Next we consider the case when either or both of the boundary points of F{v), b~ = — 

ofTv and b'^ = + are contained in B{y). It suffices to show that, /cl(v, h) > I(v, b~) 

for all 6 < 6“ — r/, and /cl(v,6) > I{yr,b'^) for all b > b'^ + rj. We discuss only the case 
6 >&■*■ + ?7 as the treatment of 6 < 6 “ — 77 is identical. Assume that /(v, b) < /(v, b'^) (since 
in the opposite case the result follows immediately: /cl(v, 6) > /(v,6) > I(v,6+)). From 
the mean value theorem there exists ^ G {b'^, b) such that. 


/(v,6+) = /(v,6) 


ib-b+) 


5/(v, b) 

db 


b=i 


^/(v,6) + (6-6+)L 

jjE 


In the above we used the following facts: 


di(v,b) 

db 


< 0, L ^ SUPfegK 


di(v,b) 

db 


and ^ > 1. 


We define the projection index for the clustering problem as the minimum of the pe¬ 
nalised density integral, 

4>clM = min /cl (v, 6). (13) 

oGK 

Since the optimisation problem of Equation (13) is one-dimensional it is simple to compute 
the set of global minimisers Bc{'v). As we discuss in Section 4, this is necessary to compute 
directional derivatives of the projection index, as well as, to determine whether (/)cl is dif¬ 
ferentiable. We call the optimisation of /cL; minimum density projection pursuit (MDP^). 
For each v, MDP^ considers only the optimal choice of b. This enables it to avoid local 
minima of /(v, •). Most importantly MDP^ is able to accommodate a discontinuous change 
in the location of the global minimiser(s), argmin^gj^ /cl(v, 6), as v changes. Neither of 
the above can be achieved when the optimisation is jointly over (v, b) as in the original 
constrained optimisation problem. Equation (9). The projection index /cl is continuous, 
but it is not guaranteed to be everywhere differentiable when Bc{^) is not a singleton. The 
resulting optimisation problem is therefore nonsmooth and nonconvex. 

To illustrate the effectiveness of MDP^ to estimate MDHs, we compare this approach 
with a direct optimisation of the constrained problem given in Equation (9) using SQP. To 
enable visualisation we consider the two-dimensional SI data set (Franti and Virmajoki, 
2006), constructed by sampling from a Gaussian mixture distribution with fifteen compo¬ 
nents, where each component corresponds to a cluster. Figure 1 depicts the MDHs obtained 
over 100 random initialisations of SQP and MDP^. It is evident that SQP frequently yields 
hyperplanes that intersect regions with high probability density thus splitting clusters. As 
SQP always converged in these experiments the poor performance is solely due to conver¬ 
gence to local minima. In contrast, MDP^ converges to three different solutions over the 100 
experiments, all of which induce high quality partitions, and none intersects a high-density 
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Figure 1: Binary partitions induced by 100 MDHs estimated through SQP and MDP^ 


Projection Index: (1 )cl{0) 



■ SQP 
□ MDP^ 


J ■ I li I 


0 Jr/4 



7t/2 3 jc/4 

Projection Angle: 0 


Figure 2: Projection index for SI data set and solutions obtained 


through SQP and MDP^ 


cluster. In polar coordinates any v G 5^ can be parameterised through a single projection 
angle. Using this parameterisation, the upper plot of Figure 2 depicts the value of the 
projection index, (^cL(v(6i)), for 6 G [0, tt]. The lower plot of the figure provides histograms 
of the distribution of the solutions (locally optimal projection angles) obtained over the 100 
experiments with SQP (grey) and MDP^ (white). The figure shows that i?icL(v) is con¬ 
tinuous but not everywhere differentiable. The solution most frequently obtained through 
MDP^ corresponds to the global optimum, while the only other two solutions identified are 
the local minimisers with the next two lowest function values. In contrast SQP converges 
to a much wider range of solutions. Note that this method is not guaranteed to identify the 
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optimal value of b for any v(0) and this indeed occurs in this example. Therefore the value 
of </>cl(v) is a lower bound for the function values of the minimisers identified through SQP. 

2.2 Semi-Supervised Classification 

In semi-supervised classihcation labels are available for a subset of the data sample. The 
resulting classifier needs to predict as accurately as possible the labelled examples, while 
avoiding intersection with high-density regions of the empirical density. The MDH formula¬ 
tion can readily accommodate partially labelled data by incorporating the linear constraints 
associated with the labelled data into the clustering formulation. Without loss of generality 
assume that the first i examples are labelled by y = (yi,... , ye )'^ G {“1; ■ The MDH 

for semi-supervised classification is the solution to the problem, 

min I{v,b), (14a) 

(v,fe)e5'*-ixR 

subject to: yj(v • x* — 6 ) ^ 0, Vi = 1,..., £, (14b) 

b — fiv + acJv ^ 0, (14c) 

/iv + a<Tv — & ^ 0, (14d) 

where I{v, b), , and Uv are computed over the entire data set. If the labelled examples 
are linearly separable the constraints in Equation (14) define a nonempty feasible set of 
hyperplanes, 

Flb = |i?(v, 6 ) I (v, 6 ) G 5"*“^ X M ,6 G F(v),yi(v • Xj - 6 ) ^ 0, Vi G {1,... ,i}| C C. 

(15) 

Equations (14c) and (14d) act as a balancing constraint which discourages MDHs that 
classify the vast majority of unlabelled data to a single class. Balancing constraints are 
included in the estimation of S^VMs for the same reason (Joachims, 1999; Chapelle and 
Zien, 2005). 

As in the case of clustering, the direct minimisation of Equation (14) frequently leads 
to locally optimal solutions. To mitigate this we again propose a projection pursuit formu¬ 
lation. We define the penalised density integral for semi-supervised classification as, 

i 

/ssc(v, 6 ) = /cl(v, 6 ) -F 7 ^max{ 0 ,-yi(v • x* - b)Y^^ (16) 

i=l 

where, 7 > 0 is a user-defined constant, which controls the trade-off between reducing the 
density on the hyperplane, and misclassifying the labelled examples. The projection index 
is then defined as the minimum of the penalised density integral, 

0ssc(v) = min/ssc(v, 6 ). (17) 

3. Connection to Maximnm Margin Hyperplanes 

In this section we discuss the connection between MDHs and maximum (hard) margin 
hyperplane separators. The margin of a hyperplane 77(v, b) with respect to a data set X is 
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defined as the minimum Euclidean distance between the hyperplane and its nearest datum, 

margin H (v, b) = min | v • x — 6|. (18) 

xSA” 

The points whose distance to the hyperplane H{'v,b) is equal to the margin of the hyper¬ 
plane, that is, arg | v-x—6|, are called the support points of H (v, b). Let F denote the 

set of feasible hyperplanes; then the maximum margin hyperplane (MMH), G F 

satisfies, 

margin iL(v™', ft*”) = max margin iL(v, 6). (19) 

(v,fe)|H(v,6)eF 

The main result of this section is Theorem 5, which states that as the bandwidth pa¬ 
rameter, h, is reduced to zero the MDH converges to the MMH. An intermediate result. 
Lemma 4, shows that there exists a positive bandwidth, h' > such that, for all h G (0, h'), 
the partition of the data set induced by the MDH is identical to that of the MMH. 

We first discuss some assumptions which allow us to present the theoretical results of 
this section. As before we assume a fixed and finite data set X C and approximate 
its (assumed) underlying probability density function via a kernel density estimator using 
Gaussian kernels with isotropic bandwidth matrix h^L We assume that the interior of the 
convex hull of the data, Int(conv A), is non-empty, and define C as the set of hyperplanes 
that intersect Int(conv X), as in Equation (5). The set of feasible hyperplanes, F, for 
either clustering or the semi-supervised classification satisfies F C C. By construction 
every F[{'v,b) G F defines a hyperplane which partitions X into two non-empty subsets. 
Observe that if for each v G the set {b G M | Lf(v, 6) G F} is compact, then by the 

compactness of 5*^“^ a maximum margin hyperplane in F exists. For both the clustering 
and semi-supervised classihcation problems this compactness holds by construction. 

For any h > 0, let (v^,6)() G x M parameterise a hyperplane which achieves the 

minimal density integral over all hyperplanes in F, for bandwidth matrix h?F That is, 

= , ,,min /(v,6). (20) 

{v,b)\H(v,b)&F 

Following the approach of Tong and Roller (2000) we first show that as the bandwidth, h, 
is reduced towards zero, the density on a hyperplane is dominated by its nearest point. 
This is achieved by establishing that for all sufficiently small values of h, a hyperplane with 
non-zero margin has lower density integral than any other hyperplane with smaller margin. 

Lemma 3 Take F[{v,b) G F with non-zero margin and 0 < 5 < margin iL(v, 6) := Mv,fe. 
Then 3h' > 0 such that h G (0, h') and '■= margin iL(w, c) ^ — 5 implies 

I{^r,b) < i{w,c). 

Proof 

Using Equation (3) it is easy to see that, 

/(v,6) ^ 

inf |i(w,c) I Mw,c ^ - (jj ^ 


1 


hy/^ 

1 


h\/^ 


exp 


exp 


n. 
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Therefore, 


0 ^ lim 

/i—>-0+ 


/(v,6) 


^ lim 


m2 , 

v.b 




inf |/(w,c) I Mw,c ^ - (5| exp |- | 


= 0 . 


Therefore, 3h' > 0 such that h G (0, h') 


inf 


{/(w,c) 


Tv.fc) 


^'w,c^^v,b 


2/l2 

< 1 . 


An immediate corollary of Lemma 3 is that as h tends to zero the margin of the MDH 
tends to the maximum margin. However, this does not necessarily ensure the stronger 
result that the sequence of MDHs converges to the MMH. To establish this we require two 
technical results, which describe some algebraic properties of the MMH, and are provided 
as part of the proof of Theorem 5 which is given in Appendix A. 

The next lemma uses the previous result to show that there exists a positive bandwidth, 
h' > 0, such that an MDH estimated using h G (0, h') induces the same partition of X as 
the MMH. The result assumes that the MMH is unique. Notice that if A is a sample of 
realisations of a continuous random variable then this uniqueness holds with probability 1. 

Lemma 4 Suppose there is a unique hyperplane in F with maximum margin, whieh can be 
parameterised by (v™,6™) G x M. Then 3h' > 0 s.t. h G (0,/i') => induces 

the same partition of X as 

Proof 

Let M = margin Lr(v”^, b'^), and let P be the collection of hyperplanes that induce the 
same partition of X as that induced by iL(v™,6™). Since X is finite and iL(v™',6™') is 
unique, 3(5 > 0 s.t. H{w,c) ^ P margin P(u;, c) ^ M — 6. By Lemma 3, 3h' > 0 s.t., 

h G (0,/i') ^ {Lf(w,c) I margin iL(w, c) ^ M — (5} , 

therefore P(v^,6^) G P. 


The next theorem is the main result of this section, and states that the MDH converges to 
the MMH as the bandwidth parameter is reduced to zero. Notice that by the non-unique 
representation of hyperplanes, the maximum margin hyperplane has two parameterisations 
in C, namely (v™, 6™) and (—v”^, —b^). Convergence to the maximum margin hyperplane 
is therefore equivalent to showing that, 

min{||K,6;;) - (v™,6™)||, \\i^rl,bl) + (v-,6-)||} ^ 0 as L ^ 0+. 

Theorem 5 Suppose there is a unique hyperplane in F with maximum margin, which can 
he parameterised by (v™',6™') G S'^~^ x M. Then, 

lim min{UK, 6*,) - (v-,6™)||, ||K,^*.) + = 0. 
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The set F used in Theorem 5 is generic so it can capture the constraints associated with 
both clustering and semi-supervised classification, Equations (9) and (14) respectively. In 
the case of semi-supervised classification we must also assume that the labelled data are 
linearly separable. Theorem 5 is not directly applicable to the MDP^ formulations as 
in this case the function being minimised is not the density on a hyperplane. The next 
two subsections establish this result for the MDP^ formulation of the clustering and semi- 
supervised classification problem. 

3.1 MDP^ for Clustering 

We have shown that for the constrained optimisation formulation the MDH converges to 
the MMH within the feasible set, Tcl C C. In addition, for a fixed v. Proposition 2 bounds 
the distance between minimisers of the penalised function /cl, argmin^gj^ /cl(v, 6), and 
the optimal b of the constrained problem, argmin^g^j-.^,) /(v, 6). Combining these we can 
show that the optimal solution to the penalised MDP^ formulation converges to the maxi¬ 
mum margin hyperplane in Tcl, provided the parameters within the penalty term suitably 
depend on the bandwidth parameter, h. While the general case can be shown, for ease 
of exposition we make the simplifying assumption that the maximum margin hyperplane 
is strictly feasible, i.e., if (v™,6™) parameterises the maximum margin hyperplane then 
fe™' G — acTv™, /Xv"! + aa^m.). 

For h,rj, L > 0 define (v^ n L’^hr] l) fo be any global minimiser of /cl, he., 

/cl(v,6). 

' ' (v,b)e5‘*-ixR 


Lemma 6 Suppose there is a unique hyperplane in Fql with maximum margin, which can 
be parameterised by (v™',6™') G 5*^“^ x M. Suppose further that 6™ G 
aa^m). For h > 0, let L{h) = , and 0 < r]{h) ^ h. Then, 


lim miniII (v 
h—>-0+ 


-k 

h,r](h)^L(h) 


^h,r]{h),L{h) 


)-(v™,6" 


^h^r{(h),L{h) 


K,r){h),L(h)^ + 


mi} = o. 


Proof 

Let M = marginLf(v™', ft™') and as in the proof of Lemma 4, let ft > 0 be such that 
any hyperplane inducing a different partition from Lf(v™,ft™) has margin at most M — ft. 
Consider the set F^^ ;= {(v,ft) G x M|ft G B 5 / 2 (-^(v))}, where we used the notation 
]B 5 / 2 (E(v)) to denote the neighbourhood of E(v) given by {r G M|(i(r, F(v)) < ft/2}. The 
set Fql increases the feasible set of hyperplanes by allowing ft to range in ft G 185/2 (-^(v)). 
For any fixed v, the maximum margin of all hyperplanes with normal vector v can increase 
by at most ft/2. Thus, any hyperplane inducing a different partition compared to Lf(v™, ft™) 
has a margin at most M — ft/2. Since Lf(vm, bm) is strictly feasible it therefore remains the 
unique maximum margin hyperplane in Fc^- Observe now that for 0 < ft. < ft/2 we have 

^ by Proposition 2. In addition, by Theorem 5, we know 

that the minimisers of /(v, ft) over say H{w{,b{), satisfy 


lim min 

/i—^0+ 


{ 


(vtftl)-(v™,ft" 




bi) + (v™ 



= 0 . 
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Figure 3: Convergence of the MDH to the maximum margin hyperplane for a decreasing 
sequence of bandwidth parameters, h. 


Now, since 6"*) is strictly feasible 3e' > 0 s.t. (v, b) G ]Bg/({(v™', 6™), —(v™-, 6™')}) =► 

i/(v,6) G Fcl- Then for any 0 < e < e' there exists h' > 0 s.t. for 0 < h < /i' both 
K.l-l) £ 6”),-(v”,(.“))) ^ € Fcl and ^6U,w,i(i)) £ 

Fql- Now for H{v, b) G \ Fql we know that /(v, b) < /cl(v, b), whereas for H{v, b) G 
FcL,Ii^, b) = /cl(v, b) and therefore the minimiser of /cl(v, b) must lie in the neighbour¬ 
hood ]B£({(v”^, 6™'), —(v™", 6™)}), and the result follows. 


To illustrate the convergence of the MDH to the MMH we use the two-dimensional 
data set shown in Figure 3. The data is sampled from a mixture of two Gaussian dis¬ 
tributions with equal covariance matrix. The MDH with respect to the true underlying 
density is ((1, — 1), 0). A large margin separator is artificially introduced by removing a 
few observations in a narrow margin around a hyperplane different from H ((1, —1), 0). The 
margin is intentionally small to ensure that identifying the MMH is non-trivial. Figure 3 
illustrates the MDH solutions arising from the MDP^ method for a decreasing sequence of 
bandwidths, h. Initially the MDH approximately coincides with the optimal MDH with 
respect to the true density of the Gaussian mixture. As h decreases, the MDH approaches 
the MMH and for the smallest values of h the two are indistinguishable. 

3.2 MDP^ for Semi-Supervised Classification 

Denote the set of hyperplanes which correctly classify the labelled data by Flb- Under the 
assumption that 3H{'v, b) G Flb G Fcl with non-zero margin, we can show that, provided 
the parameter 7 does not shrink too quickly with /i, the hyperplane that minimises /ssc 
converges to the MMH contained in UlbIGFcl, where as before we assume that such an MMH 
is strictly feasible. To establish this result it is sufficient to show that there exists F > 0 such 
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that for all h G ( 0 , h'), the optimal hyperplane H (v^ rj L'y) correctly classifies all the 

labelled examples. If this holds, then ^11 

sufficiently small h, and hence Lemma 6 can be applied to establish the result. The proof 
relies on the fact that the penalty terms associated with the known labels in Equation (16) 
are polynomials in b. Provided that 7 is bounded below by a polynomial in h, the value of 
the penalty terms for hyperplanes that do not correctly classify the labelled data dominate 
the value of the density integral as h approaches zero. Therefore the optimal hyperplane 
must correctly classify the labelled data for small values of h. 


Lemma 7 Define Flb = (v, b)\yi{'v-Xi — b) > 0, Vi = and Fql = 6)|;Uv — 

auv ^ b ^ fiv + auv} and assume that Fssc = Flb H Fql 0 and that 3LI(v, b) G Tssc 
with non-zero margin. For h > 0, let L{h) = 0 < r]{h) ^ h and 7 (/i) ^ 

for some r > 0. Then 3h' > 0 s.t h G (0,h') ^ G Flb- 


Proof 

Consider H{^,h) 0 Flb- Then, 

/ssc(v, 6 ) ^ l- 

ny iTih 


exp(-jy2/2/i2) + -i{h)ul+^ > -i{h)vl+\ 


where Ui, > 0 minimises exp(—i/^/2/i^)+7(/i)zz^^'^. Therefore, r'* is the unique positive 

number satisfying, 

“p {-^) (-f) + " 

= (1 + e)'j{h)nV^h^ exp ^ 


2h? 


We therefore have. 


i/* ^ ((1 + e)^{h)nV^hfi^ 


/ssc(v,6) > 7(h) ((1 + e)7(/i)n\/^/i^y ' 


2 3(1+7 

= K^{h)^-'^h 1 -' 

2r+3(l+£) 

^ Kh 1 -^ , 


where iV is a constant which can be chosen independent of (v, 6 ). Finally, for any F[{v',b') G 
Tssc with non-zero margin, 3h' > 0 s.t. 

, , , - , , 2r+3(l + e) 

/i G (0,/i) ^/ssc(v ,6 ) =/(v ,6 ) < iVh 1 -= </ssc(v, 6 ). 

Since K is independent of (v, b), the result follows. The final set of inequalities holds since 
the hyperplane Lf(v', 6 ') is assumed to have non-zero margin, say M^f y > 0 , and hence 
/(v', 6 ') ^ exp{—M v' 7 '/ 2 /i-^}, which tends to zero faster than any polynomial in h. 
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4. Estimation of Minimum Density Hyperplanes 

In this section we discuss the computation of MDHs. We first investigate the continuity and 
differentiability properties required to optimise the projection indices (/>cl(v) and </)ssc(v). 

Since the domain of both projection indices, (?i>cL(v) and (/>ssc(v), is the boundary of 
the unit-sphere in it is more convenient to express v in terms of spherical coordinates, 


Vi{9) 


cos(6'j) 0}=! sm{0j), i = l,...,d-l 
0^=1 sin(6'j), i = d, 


( 21 ) 


where 0 G 0 = [0,7r]^“^ x [0, 27r] is called the projection angle. Using spherical coordinates 
renders the domain, 0, convex and compact, and reduces dimensionality by one. 

As the following discussion applies to both i^cl(v) and (/>ssc(v) we denote a generic 
projection index (p : Q ^ M., and the associated set of minimisers, as. 


(/)(6») = min/(v(6»),6), (22) 

beA 

Bi9) = {b€A\fivie),b) = m}, (23) 

where /(v(0),6) is continuously differentiable, A C M is compact and convex, and the 
correspondence B{9) gives the set of global minimisers of f{v{9), b) for each 9. The definition 
of A is not critical in our formulation. Setting, 


A D 


min {^v} - ao-pc. 
vGS^~^ 


r], max {^v} + aUpc. r/ , 
ve.S'*-! 


(24) 


where is the variance of the projections along the first principal component, ensures 
that the set of hyperplanes that satisfy the constraint of Equation (7) will be a subset of A 
for all V. 

Berge’s maximum theorem (Berge, 1963; Polak, 1987), establishes the continuity of j>(9) 
and the upper-semicontinuity (u.s.c.) of the correspondence B{9). Theorem 3.1 in (Polak, 
1987) enables us to establish that (l){9) is locally Lipschitz continuous. Using Theorem 4.13 
of Bonnans and Shapiro (2000) we can further show that <i){9) is directionally differentiable 
everywhere. The directional derivative at 9 in the direction v is given by. 


d(j){e-, v) = ^mn^ Def{^r{9), b) • u, (25) 

where Dq denotes the derivative with respect to 9. It is clear from Equation (25) that 0(0) 
is differentiable if Dgf{v{9), b) is the same for all b £ B{9). If B{9) is a singleton then this 
condition is trivially satished and 0(0) is continuously differentiable at 0. 

It is possible to construct examples in which B{9) is not a singleton. However, with the 
exception of contrived examples, our experience with real and simulated data sets indicates 
that when h is set through standard bandwidth selection rules B[9) is almost always a 
singleton over the optimisation path. 


Proposition 8 Suppose B{9) is a singleton for almost all 9 £ Q. Then 0(0) is continuously 
differentiable almost everywhere. 
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Proof The result follows immediately from the fact that if B{9) = {6} is a singleton, then 
the derivative D(p{6) = Dgf{v{9),b), which is continuous. ■ 

Wolfe (1972) has provided early examples of how standard gradient-based methods can 
fail to converge to a local optimum when used to minimise nonsmooth functions. In the 
last decade a new class of nonsmooth optimisation algorithms has been developed based on 
gradient sampling (Burke et al., 2006). Gradient sampling methods use generalised gradient 
descent to find local minima. At each iteration points are randomly sampled in a radius e of 
the current candidate solution, and the gradient at each point is computed. The convex hull 
of these gradients serves as an approximation of the e-Clarke generalised gradient (Burke 
et al., 2002). The minimum element in the convex hull of these gradients is a descent 
direction. The gradient sampling algorithm progressively reduces the sampling radius so 
that the convex hull approximates the Clarke generalised gradient. When the origin is 
contained in the Clarke generalised gradient there is no direction of descent, and hence 
the current candidate solution is a local minimum. Gradient sampling achieves almost sure 
global convergence for functions that are locally Lipschitz continuous and almost everywhere 
continuously differentiable. It is also well documented that it is an effective optimisation 
method for functions that are only locally Lipschitz continuous. 

4.1 Computational Complexity 

In this subsection we analyse the computational complexity of MDP^. At each iteration the 
algorithm projects the data sample onto v(0) which involves 0{nd) operations. To compute 
the projection index, (l){9), we need to minimise the penalised density integral, f{v{9),b). 
This can be achieved by first evaluating f{v{9),b) on a grid of m points, to bracket the 
location of the minimiser, and then applying bisection to compute the minimiser(s) within 
the desired accuracy. The main computational cost of this procedure is due to the first 
step which involves m evaluations of a kernel density estimator with n kernels. Using the 
improved fast Gauss transform (Morariu et al., 2008) this can be performed in 0{m + 
n) operations, instead of 0{mn). Bisection requires 0(—log 2 e) iterations to locate the 
minimiser with accuracy e. 

If the minimiser of the penalised density integral 6* = argmin^g^/(v(0), 6), is unique the 
projection index is continuously differentiable at 9. To obtain the derivative of the projection 
index it is convenient to define the projection function, P(v) = (xi • v,..., x„ • v)"*^. An 
application of the chain rule yields, 

de^ = Def{^{9),b*) = Dpf{w{9), b*)D^PDe^ (26) 

where the derivative of the projections of the data sample with respect to v is equal to 
the data matrix, D^P = (xi,... ,x„)''^; and Dgv is the derivative of v with respect to the 
projection angle, which yields a d x (d — 1) matrix. The computation of the derivative 
therefore requires 0{d{n + d)) operations. 

The original GS algorithm requires 0{d) gradient evaluations at each iteration which 
is costly. Curtis and Que (2013) have developed an adaptive gradient sampling algorithm 
that requires 0(1) gradient evaluations in each iteration. More recently, Lewis and Overton 
(2013) have strongly advocated that for the minimisation of nonsmooth, nonconvex, locally 
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n 

d 

c 

banknote^ 

1372 

4 

2 

br. cancer“ 

699 

9 

2 

forest“ 

523 

27 

4 

ionosphere® 

351 

33 

2 

optidigits® 

5618 

64 

10 

pendigits® 

10992 

16 

10 

seeds® 

210 

7 

3 

smartphone ® 

10929 

561 

12 

image seg.® 

2309 

18 

7 

satellite® 

6435 

36 

6 

synth® 

600 

60 

6 

voting® 

435 

16 

2 

wine® 

178 

13 

3 

yeast^ 

698 

72 

5 


a. UCI machine learning repository https://archive.ics.uci.edu/inl/datasets.htinl 

b. Stanford Yeast Cell Cycle Analysis Project http://genome-www.stanford.edu/cellcycle/ 

Table 1; Details of benchmark data sets: size (n), dimensionality (d), number of clusters (c). 


Lipschitz functions, a simple BFGS method using inexact line searches is much more efficient 
in practice than gradient sampling, although no convergence guarantees have been estab¬ 
lished for this method. BFGS requires a single gradient evaluation at each iteration and a 
matrix vector operation to update the Hessian matrix approximation. In our experiments 
we use the BFGS algorithm. 

5. Experimental Results 

In this section we assess the empirical performance of MDHs for clustering and semi- 
supervised classihcation. We compare performance with existing state-of-the-art meth¬ 
ods for both problems on the following 14 benchmark data sets; Banknote authentication 
(banknote). Breast Cancer Wisconsin original (br. cancer), Forest type mapping (forest). 
Ionosphere, Optical recognition of handwritten digits (optidigits). Pen-based recognition of 
hand-written digits (pendigits). Seeds, Smartphone-Based Recognition of Human Activities 
and Postural Transitions (smartphone), Statlog Image Segmentation (image seg.), Statlog 
Landsat Satellite (satellite), Synthetic control chart time series (synth control). Congres¬ 
sional voting records (voting). Wine, and Yeast cell cycle analysis (yeast). Details of these 
data sets, in terms of their size, n, dimensionality, d and number of clusters, c, can be seen 
in Table 1. 

5.1 Clustering 

Since an MDH yields a bi-partition of a data set rather than a complete clustering, we 
propose two measures to assess the quality of a binary partition of a data set containing an 
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arbitrary number of clusters. Both take values in [0,1] with larger values indicating a better 
partition. These measures are motivated by the fact that a good binary partition should (a) 
avoid dividing clusters between elements of the partition, and (b) be able to discriminate 
at least one cluster from the rest of the data. To capture this we modify the cluster labels 
of the data by assigning each cluster to the element of the binary partition which contains 
the majority of its members. In the case of a tie the cluster is assigned to the smaller of the 
two partitions. We thus merge the true clusters into two aggregate clusters, Ci and € 2 - 

The hrst measure we use is the binary V-measure which is simply the V-measure (Rosen¬ 
berg and Hirschberg, 2007) computed on Ci, C2 with respect to the binary partition, which 
we denote 111,112. The V-measure is the harmonic mean of homogeneity and complete¬ 
ness. For a data set containing clusters Ci,..., Cc, partitioned as Hi,..., 11^, homogeneity 
is defined as the conditional entropy of the cluster distribution within each partition, Ilj. 
Completeness is symmetric to homogeneity and measures the conditional entropy of each 
partition within each cluster, Cj. An important characteristic of the V-measure for evaluat¬ 
ing binary partitions is that if the distribution of clusters within each partition is equal to the 
overall cluster distribution in the data set then the V-measure is equal to zero (Rosenberg 
and Hirschberg, 2007). This means that if an algorithm fails to distinguish the majority 
of any of the clusters from the remainder of the data, the binary V-measure returns zero 
performance. Other evaluation metrics for clustering, such as purity and the Rand index, 
can assign a high value to such partitions. 

To define the second performance measure we hrst determine the number of correctly 
and incorrectly classihed samples. The error of a binary partition, E(ni,n 2 ), given in 
Equation (27), is dehned as the number of elements of each aggregate cluster which are 
not in the same partition as the majority of their original clusters. In contrast, the success 
of a partition, S(ni,n 2 ), Equation (28), measures the number of samples which are in the 
same partition as the majority of their original clusters. The Success Ratio, SR(ni,n 2 ), 
Equation (29), captures the extent to which the majority of at least one cluster is well- 
distinguished from the rest of the data. 

E(ni,n2) = min{|ninCi| + |n2nC2|,|ninC'2| + |n2nc'i|}, (27) 

S(ni,n 2 ) = min{max{|ni n Cil, iHi n (721} ,max{|n 2 n (7i|, |n 2 n (72|}} , (28) 

S(ni,n 2 ) 

S(ni,n2) + E(ni,n2)‘ 

The Success Ratio takes the value zero if an algorithm fails to distinguish the majority of 
any cluster from the remainder of the data. 

5.1.1 Parameter Settings for MDP^ 

The two most important settings for the performance of the proposed approach are the 
initial projection direction, and the choice of a, which controls the width of the interval F{^^) 
within which the optimal hyperplane falls. Despite the ability of the MDP^ formulation to 
mitigate the effect of local minima of the projected density, the problem remains non-convex 
and local minima in the projection index can still lead to suboptimal performance. We have 
found that this effect is amplified in general when either or both the number of dimensions, 
and the number of high density clusters in the data set is large. To better handle the effect 
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of local optima, we use multiple initialisations and select the MDH that maximises the 
relative depth criterion, defined in Equation (30). The relative depth of an MDH, H{v,b), 
is defined as the smaller of the relative differences in the density on the MDH and its two 
adjacent modes in the projected density, 

min < /(v, mi), /(v, m^) > — /(v, h) 

RelativeDepth(v, b) =-^^- - - (30) 

I{v,b) 

where m; and are the two adjacent modes in the projected density on v. If an MDH 
does not separate the modes of the projected density, then its relative depth is set to zero, 
signalling a failure of MDP^ to identify a meaningful bi-partition. The relative depth is 
appealing because it captures the fact that a high quality separating hyperplane should 
have a low density integral, and separate well the modes of the projected density. Note 
also that the relative depth is equivalent to the inverse of a measure used to define cluster 
overlap in the context of Gaussian mixtures (Aitnouri et ah, 2000). In all the reported 
experiments we initialise MDP^ to the first and second principal component and select the 
MDH with the largest relative depth. For the data sets listed above it was never the case 
that both initialisations led to MDHs with zero relative depth. 

The choice of a determines the trade-off between a balanced bi-partition and the ability 
to discover lower density hyperplanes. The difficulties associated with choosing this pa¬ 
rameter are illustrated in Figure 4. In each sub-figure the horizontal axis is the candidate 
projection vector, v, while the right vertical axis is the direction of maximum variability or¬ 
thogonal to V. Points correspond to projections of the data sample onto this two-dimensional 
space, while colour indicates cluster membership. The solid line depicts the projected den¬ 
sity on V, while the dotted line depicts the penalised function, /cl(v, •). The scale of both 
functions is depicted on the left vertical axis. The solid vertical line indicates the MDH 
along V. Setting a to a large value can cause MDP^ to focus on hyperplanes that have low 
density because they partition only a small subset of the data set as shown in Figure 4(a). 
In contrast smaller values of a may cause the algorithm to disregard valid lower density 
hyperplane separators (see Figure 4(b)), or for the separating hyperplane to not be a local 
minimiser of the projected density (see Figure 4(c)). 

Rather than selecting a single value for a we recommend solving MDP^ repeatedly for an 
increasing sequence of values in the range {amin, cimax}, where each implementation beyond 
the first is initialised using the solution to the previous. Setting amin close to zero forces 
MDP^ to seek low density hyperplanes that induce a balanced data partition. This tends 
to find projections which display strong multimodal structure, yet prevents convergence to 
hyperplanes that have low density because they partition a few observations, as in the case 
shown in Figure 4(a). Increasing a progressively fine-tunes the location of the MDH. To 
avoid sensitivity to the value of Omax (set to 0.9) the output of the algorithm is the last 
hyperplane that corresponds to a minimiser of the projected density. Figure 5 illustrates 
this approach using the optical recognition of handwritten digits data set from the UCI 
machine learning repository (Lichman, 2013). Figure 5(a) depicts the projected density on 
the initial projection direction, which in this case is the second principal component. As 
shown, the density is unimodal and the clusters are not well separated along this vector. 
Although not shown, if a large value of a is used from the outset, MDP^ will identify a vector 
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(a) MDH separating few observa- (b) Lower density hyperplane be- (c) MDH not a minimiser of the 
tions yond feasible region projected density 

Figure 4: Impact of choice of a on minimum density hyperplane. 


along which the projected density is unimodal and skewed. Figure 5(b) shows that after 
five iterations with a = 10“^ MDP^ has identihed a projection vector with bimodal density. 
In subsequent iterations the two modes become more clearly separated, Figure 5(c), while 
increasing a enables MDP^ to locate an MDH that corresponds to a minimiser of /(v,6), 
as illustrated in Figure 5(d). 

In all experiments we set the bandwidth parameter toh = 0.9dpcjn“^/®, where dpc^ is the 
estimated standard deviation of the data projected onto the first principal component. This 
bandwidth selection rule is recommended when the density being approximated is assumed 
to be multimodal (Silverman, 1986). The parameter r] controls the distance between the 
minimisers of argmin^gj^ /cl(v, b) and arg min^g^j’^^) /(v, 6), while larger values of e increase 
the smoothness of the penalised function /cl- Values of r/ close to zero affect the numerical 
stability of the one-dimensional optimisation problem, due to the term ^ in /cl becoming 
very large. We used r] = 10“^ and e = 1 — 10“® to avoid numerical instability. Beyond these 
numerical problems the values of rj and e do not affect the solutions obtained through MDP^. 

5.1.2 Performance Evaluation 

We compare the performance of MDP^ for clustering with the following methods: 

1. fc-means-b-1- (Arthur and Vassilvitskii, 2007), a version of A:-means that is guaranteed 
to be Cl(log A:)-competitive to the optimal /c-means clustering. 

2. The adaptive linear discriminant analysis guided /c-means (LDA-A:m) (Ding and Li, 
2007). LDA-A:m attempts to discover the most discriminative linear subspace for 
clustering by iteratively using A:-means, to assign labels to observations, and LDA to 
identify the most discriminative subspace. 

3. The principal direction divisive partitioning (PDDP) (Boley, 1998), and the density- 
enhanced PDDP (dePDDP) (Tasoulis et ah, 2010). Both methods project the data 
onto the first principal component. PDDP splits at the mean of the projections, while 
dePDDP splits at the lowest local minimum of the one-dimensional density estimator. 
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(a) Initial projection 


(b) Projection after 5 iterations with q = 10 ^ 




(c) Projection after 25 iterations with a = 10 ^ (d) Final projection with a = Omax 

Figure 5: Evolution of the minimum density hyperplane through consecutive iterations. 

4. The iterative support vector regression algorithm for MMC (Zhang et ah, 2009) using 
the inner product and Gaussian kernel, iSVR-L and iSVR-G respectively. Both are 
initialised with the output of 2-means++. 

5. Normalised cut spectral clustering (SCn) (Ng et ah, 2002) using the Gaussian affin¬ 
ity function, and the automatic bandwidth selection method of Zelnik-Manor and 
Perona (2004). This choice of kernel and bandwidth produced substantially better 
performance than alternative choices considered. For data sets that are too large for 
the eigen decomposition of the Gram matrix to be feasible we employed the Nystrom 
method (Fowlkes et ah, 2004). 

We also considered the density-based clustering algorithm PdfGluster (Menardi and 
Azzalini, 2014), but this algorithm could not be executed on the larger data sets and so 
its performance is not reported in this paper. With the exception of SGn and iSVR-G, 
the methods considered bi-partition the data through a hyperplane in the original feature 
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space. For the 2-means and LDA-2m algorithm the hyperplane separator bisects the line 
segment joining the two centroids. iSVR-L directly seeks the maximum margin hyperplane 
in the original space, while iSVR-G seeks the maximum margin hyperplane in the feature 
space defined by the Gaussian kernel. PDDP and dePDDP use a hyperplane whose normal 
vector is the first principal component. PDDP uses a fixed split point while dePDDP uses 
the hyperplane with minimum density along the fixed projection direction. 

Table 2 reports the performance of the considered methods with respect to the suc¬ 
cess ratio (SR) and the binary V-measure (V-m) on the fourteen data sets. In addition 
Figures 6(a) and 6(b) provide summaries of the overall performance on all data sets using 
boxplots of the raw performance measures as well as the associated regret. The regret of 
an algorithm on a given data set is defined as the difference between the best performance 
attained on this data set and the performance of this algorithm. By comparing against 
the best performing clustering algorithm regret accommodates for differences in difficulty 
between clustering problems, while also making use of the magnitude of performance dif¬ 
ferences between algorithms. The distribution of performance with respect to both SR and 
V-m is negatively skewed for most methods, and as a result the median is higher than the 
mean (indicated with a red dot). 

It is clear from Table 2 that no single method is consistently superior to all others, 
although MDP^ achieves the highest or tied highest performance on seven data sets (more 
than any other method). More importantly MDP^ is among the best performing methods in 
almost all cases. This fact is better captured by the regret distributions in Figure 6(b). Here 
we see that the average, median, and maximum regret of MDP^ is substantially lower than 
any of the competing methods. In addition MDP^ achieves the highest mean and median 
performance with respect to both SR and V-m, while also having much lower variability in 
performance when compared with most other methods. 

Pairwise comparisons between MDP^ and other methods reveal some less obvious facts. 
SCn achieves higher performance than MDP^ in more examples (six) than any other com¬ 
peting method, however it is much less consistent in its performance, obtaining very poor 
performance on five of the data sets. The iSVR maximum margin clustering approach is 
arguably the closest competitor to MDP^. iSVR-L and iSVR-G achieve the second and 
third highest average performance with respect to V-m and SR respectively. The PDDP al¬ 
gorithm is the second best performing method on average with respect to SR, but performs 
poorly with respect to V-m. The density enhanced variant, dePDDP, performs on average 
much worse than MDP^. This approach is similarly motivated by obtaining hyperplanes 
with low density integral, and its low average performance indicates the usefulness of search¬ 
ing for high quality projections as opposed to always using the first principal component. 
Finally, neither of the fc-means variants appears to be competitive with MDP^ in general. 

5.2 Semi-Supervised Classification 

In this section we evaluate MDHs for semi-supervised classification. We compare MDHs 
against three state-of-the-art semi-supervised classification methods: Laplacian Regularised 
Support Vector Machines (LapSVM) (Belkin et ah, 2006), Simple Semi-Supervised Learning 
(SSSL) (Ji et ah, 2012), and Correlated Nystrom Views (XNV) (McWilliams et ah, 2013). 
For all methods the inner product kernel was used to render the resulting classifiers linear, 
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MDP2 

iSVR-L 

iSVR-G 

SCn 

LDA-2m 

2-means-|—I- 

PDDP 

dePDDP 

Data set 

SR 

V-m 

SR 

V-m 

SR 

V-m 

SR 

V-m 

SR 

V-m 

SR 

V-m 

SR 

V-m 

SR 

V-m 

banknote 

0.79 

0.55 

0 

0 

0.35 

0 

0.46 

0.10 

0 

0.01 

0.37 

0.01 

0.40 

0.03 

0 

0.03 

br. cancer 

0.91 

0.79 

0.73 

0.56 

0.73 

0.56 

0 

0.13 

0.87 

0.71 

0.87 

0.72 

0.91 

0.78 

0.90 

0.77 

forest 

0.78 

0.67 

0.90 

0.72 

0.91 

0.74 

0.56 

0.41 

0.76 

0.63 

0.72 

0.58 

0.64 

0.36 

0 

0 

image seg. 

0.89 

0.72 

0.82 

0.59 

0.88 

0.71 

0.92 

0.87 

0.78 

0.58 

0.78 

0.71 

0.87 

0.67 

1 

1 

ionosphere 

0.48 

0.13 

0.47 

0.13 

0.47 

0.13 

0.55 

0.22 

0.47 

0.12 

0.47 

0.12 

0.47 

0.12 

0.42 

0.09 

optidigits 

0.93 

0.85 

0.63 

0.29 

0.82 

0.60 

0 

0 

0.81 

0.62 

0.92 

0.82 

0.68 

0.30 

0 

0 

pendigits 

0.74 

0.39 

0.79 

0.55 

0.88 

0.68 

0.80 

0.68 

0.79 

0.55 

0.78 

0.57 

0.79 

0.54 

0.61 

0.42 

satellite 

0.89 

0.75 

0.73 

0.40 

0.73 

0.40 

0.92 

0.86 

0.73 

0.40 

0.87 

0.81 

0.71 

0.37 

0 

0 

seeds 

0.88 

0.73 

0.71 

0.53 

0.71 

0.53 

0.89 

0.76 

0.96 

0.90 

0.86 

0.70 

0.75 

0.59 

0.73 

0.60 

smartphone 

0.99 

0.97 

0.99 

0.95 

0.99 

0.96 

0.99 

0.94 

0.99 

0.97 

0.99 

0.94 

0.99 

0.95 

0 

0 

Synth 

0.98 

0.94 

0.94 

0.83 

0.94 

0.83 

1 

1 

0.88 

0.76 

1 

1 

0.69 

0.51 

1 

1 

voting 

0.70 

0.43 

0.46 

0.09 

0 

0 

0 

0.05 

0.69 

0.41 

0 

0 

0.70 

0.40 

0.68 

0.38 

wine 

0.77 

0.61 

0.70 

0.52 

0.69 

0.50 

0.67 

0.48 

0.66 

0.48 

0.68 

0.49 

0.65 

0.46 

0.68 

0.49 

yeast 

0.92 

0.76 

0.89 

0.68 

0.91 

0.72 

0.84 

0.61 

0.86 

0.63 

0.91 

0.73 

0.87 

0.65 

0 

0 

Average Improvement 

0.13 

0.18 

0.12 

0.14 

0.22 

0.16 

0.10 

0.11 

0.10 

0.08 

0.11 

0.18 

0.40 

0.32 


Table 2; Performance on the task of binary partitioning. (Ties in best performance were 
resolved by considering more decimal places) 


Success Ratio 



n/ 


V-measuT^ 


n/ 



(a) Raw Performance Measure 


(b) Regret 


Figure 6: Performance and Regret Distributions for all Methods Considered 


and thereby comparable to our method. As the MDH is asymptotically equivalent to a 
linear S^VM we also considered the continuous formulation for the estimation of a S^VM 
proposed by Chapelle and Zien (2005). These results are omitted as this method was not 
competitive on any of the considered data sets. 
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5.2.1 Parameter Settings for MDP^ 

The existence of a few labelled examples enables an informed initialisation of MDP^. We 
consider the hrst and second principal components as well as the weight vector of a linear 
SVM trained on the labelled examples only, and initialise MDP^ with the vector that 
minimises the value of the projection index, (pssc- The penalty parameter 7 is first set 
to 0.1 and with this setting a. is progressively increased in the same way as for clustering. 
After this, a is kept at Omax and 7 is increased to 1 and then 10. Thus the emphasis is 
initially on finding a low density hyperplane with respect to the marginal density p(x). 
As the algorithm progresses the emphasis on correctly classifying the labelled examples 
increases, so as to obtain a hyperplane with low training error within the region of low 
density already determined. 

5.2.2 Performance Evaluation 

To assess the effect on performance of the number of labelled examples, £, we consider a 
range of values. We compare the methods using the subset of data sets used in the previous 
section in which the size of the smallest class exceeds 100. In total eight data sets are used. 
For each value of £, 30 random partitions into labelled and unlabelled data are considered. 
As classes are balanced in the data sets considered, performance is measured only in terms 
of classification error on the unlabelled data. For data sets with more than two classes all 
pairwise combinations of classes are considered and aggregate performance is reported. 

Figure 7 provides plots of the median and interquartile range of the classification error for 
values of i between 5 and 100 for the four data sets with two classes. Overall MDP^ appears 
to be most competitive when the number of labelled examples is small. In addition, MDP^ 
is comparable with the best performing method in almost every case. The only exception 
is the ionosphere data set where LapSVM outperforms MDP^ for all values of i. Figure 8 
provides plots of the median and interquartile range of the aggregate classification error on 
data sets containing more than two classes. As these data sets are larger we consider up to 
300 labelled examples. Note that the interquartile range for XNV is not depicted for the 
satellite data set. The variability of performance of XNV on this data set was so high that 
including the interquartile range would obscure all other information in the figure. MDP^ 
exhibits the best performance overall, and obtains the lowest median classification error, or 
tied lowest, for all data sets and values of i. 

5.3 Summary of Experimental Results 

We evaluated the performance of the MDP^ formulation for finding MDHs for both clus¬ 
tering and semi-supervised classification, on a large collection of benchmark data sets, and 
in comparison with state-of-the-art methods for both problems. 

For clustering, we found that no single method was consistently superior to all others. 
This is a result of the vastly differing nature of the data sets in terms of size, dimensionality, 
number and shape of clusters, etc. MDP^ achieved the best performance on more data 
sets than any of the competing methods, and importantly was competitive with the best 
performing method in almost every data set considered. All other methods performed poorly 
in at least as many examples. Boxplots of both the raw performance and performance regret, 
which measures the difference between each method and the best performing method on each 
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(c) breast cancer (d) ionosphere 

MDP^ median (— a —), LapSVM median (— o— ), SSSL median (— □— ), XNV 
median ( : ), with corresponding interquartile ranges given by shaded regions. 


Figure 7: Classification error for different number of labelled examples for data sets with 
two clusters. 
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100 150 200 250 

Number of Labelled Examples 


(c) satellite (d) image segmentation 

MDP^ median (— a —), LapSVM median ( — o — ), SSSL median (—□—), XNV median 
( -+- ), with corresponding interquartile ranges given by shaded regions. 


Figure 8: Classification error for different numbers of labelled examples over all pairwise 
combinations of classes. 
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data set, allowed us to summarise the comparative performance of the different methods 
across data sets. The mean and median raw performance of MDP^ is substantially higher 
than the next best performing method, and the regret is also substantially lower. 

In the case of semi-supervised classification it was apparent that MDP^ is extremely 
competitive when the number of labelled examples is (very) small, but that in some cases 
its performance does not improve as much as that of the other methods considered, when 
the labelled examples become more abundant. Our experiments suggest that overall MDP^ 
is very competitive with the state-of-the-art for semi-supervised classification problems. 

6. Conclusions 

We proposed a new hyperplane classifier for clustering and semi-supervised classification. 
The proposed approach is motivated by determining low density linear separators of the 
high-density clusters within a data set. This is achieved by minimising the integral of the 
empirical density along the hyperplane, which is computed through kernel density esti¬ 
mation. To the best of our knowledge this is the first direct implementation of the low 
density separation assumption that underlies high-density clustering and numerous influen¬ 
tial semi-supervised classification methods. We show that the minimum density hyperplane 
is asymptotically connected with maximum margin hyperplane, thereby establishing an 
important link between the proposed approach, maximum margin clustering, and semi- 
supervised support vector machines. 

The proposed formulation allows us to evaluate the integral of the density on a hyper¬ 
plane by projecting the data onto the vector normal to the hyperplane, and estimating a 
univariate kernel density estimator. This enables us to apply our method effectively and 
efficiently on data sets of much higher dimensionality than is generally possible for den¬ 
sity based clustering methods. To mitigate the problem of convergence to locally optimal 
solutions we proposed a projection pursuit formulation. 

We evaluated the minimum density hyperplane approach on a large collection of bench¬ 
mark data sets. The experimental results obtained indicate that the method is competitive 
with state-of-the-art methods for clustering and semi-supervised classification. Importantly 
the performance of the proposed approach displays low variability across a variety of data 
sets, and is robust to differences in data size, dimensionality, and number of clusters. In 
the context of semi-supervised classification, the proposed approach shows especially good 
performance when the number of labelled data is small. 
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Figure 9; Two dimensional illustration of Lemma 9 


Memorial Trust. The underlying code and data are openly available from Lancaster Uni¬ 
versity data repository at http://dx.doi.org/10.17635/lancaster/researchdata/97. 

Appendix A. Proof of Theorem 5 

Before proving Theorem 5 we require the following two technical lemmata which establish 
some algebraic properties of the maximum margin hyperplane. The following lemma shows 
that any hyperplane orthogonal to the maximum margin hyperplane results in a different 
partition of the support points of the maximum margin hyperplane. The proof relies on 
the fact that if this statement does not hold then a hyperplane with larger margin exists 
which is a contradiction. Figure 9 provides an illustration of why this result holds, (a) Any 
hyperplane orthogonal to MMH generates a different partition of the support points of 
MMH, e.g., the point highlighted in red in (b) is grouped with the lower three by the 
dotted line but with the upper two by the solid line, the MMH. If an orthogonal hyperplane 
can generate the same partition (c), then a larger margin hyperplane than the proposed 
MMH exists (d). 

Lemma 9 Suppose there is a unique hyperplane in F with maximum margin, which can 
be parameterised by (v™',6™) G 5'^“^ x M. Let M = margin //(v™, 6™), C~^ = {x G 
A I V™ • X - 6”^ = M} and C- = {x G A | 6”* - • x = M}. Then, Vw G Null(v™), 

c G M either min{w • x — c | x G C~^} ^ 0, or max{w • x — c | x G C~} ^ 0. 

Proof 
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Suppose the result does not hold, then 3(w,c) with ||w|| = l,w • v™ = 0 and min{w • 
X —c|x G C^} > 0 and max{w-x —c|x G C*”} < 0. Let m = min{|w-x —c| | x G C'+UC'“}. 
Dehne A = ^ < 1. Define u = -^p^=^(Aw + (1 - A)v”^) and d = By 

construction ||u|| = 1. For any x_|_ G we have, 


u • x+ 


_ A(w • x+ - c) + (1 - A)(v”^ • x+ - b^) 
~ VA2 + (1 - A)2 

^ Am + (1 — A)M 

" ^a2 + (1 - aF 

m? + 2M^ — Mm 

+ (2M — mY 

> M. 


Similarly one can show that d — u- x_ > M for any x_ G C , meaning that (u, d) achieves 
a larger margin on and C~ than (v*”, b"^), a contradiction. 


The next lemma uses the above result to provide an upper bound on the distance between 
pairs of support points projected onto any vector, in terms of the angle between that vector 
and v™'. 

Lemma 10 Suppose there is a unique hyperplane in F with maximum margin, which can 
be parameterised by G 5'^“^ x M. Define M = margin //(v*”, ft*”), C~^ = {x G 

T’|v™' • X — ft*” = M], and C~ = {x G X\b'^ — v™' • x = M}. There is no vector w G for 

which w • x_|_ — w • x_ > 2Mv™ • w for all pairs x+ G C^, x_ G C~. 

Proof 

Suppose such a vector exists. Define w' = w — (v™' • w)v”^. By construction w' G 
Null(v™'). For any pair X 4 . G C'^,x_ G C~ we have 

w • — w • rc- = w • — (v • wjv • — w • x_ + (v • wjv • x_ 

> ^r^ ■ w( 2 M - • x+ + 6 ™ - 6 ”^ + • x_) 

= 0 . 

Dehne c := |(min{w' •X-|_|x_|_ G +max{w' • x_ |x_ G C”}). Then min{w' •x_)- — c|x_|_ G 
C~^} > 0 and max{w' • x_ — c|x_ G C~} < 0, a contradiction. 


We are now in a position to provide the main proof of this appendix. The theorem 
states that if the maximum margin hyperplane is unique, and can be parameterised by 
(v™,6™) G X M, then 

liin min{||K,6;;) - (v-,6™)||, ||K,6*J + (v-,6-)||} = 0, 
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where is any collection of minimum density hyperplanes indexed by their 

bandwidth h > 0 . 


Proof of Theorem 5 

Define M = margin 6 ™), C~^ = {x G T | v™' • x — 6 ™ = M} and C~ = {x G 

X • X = M}. Let B = max{||x|| | x G X}. Take any e > 0 and set 0 < 5 to satisfy 

||(1 + + 2B5 ^/‘^+ (5^ = e^. Now, suppose (w,c) G x M satishes, 

w • x+ — c > M — 5, Vx+ G C'"’" and c — w • x_ > M — 6, Vx_ G C~. 

By Lemma 10 we know that 3x_|_ G C^,x- G C~ s.t. w • x_|_ — w • x_ ^ 2Mv™ • w. Thus 


V 


m 


w ^ 


w • x+ — w • x_ 


w • x+ — c + c — w • x_ 

“ 2M 

^ 2M -25 5 

^ 2M ~ ~ M' 

Thus IIV™ — w|p < Now, for each x+ G ,'v'^ • x+ — 6 
C~,b — V™ • x_ = M. Thus for any such x_|_,x_ we have. 


M and for each x_ G 


M — (5 + w • x_ < c < w • x_|_ — M + 5, 
bm — v™ -x-—(5 + w-x_ <c<w-x+ — V™ • x+ + 6™ + 5, 
6™ - (5 - (v"^ - w) • x_ < c < 6™ + (5 + (w - V™) • x+, 
b^ -5- B\\^^ - w|| < c < 6™ + (5 + B||w - v™||, 
|c-6™| < |5 + 5||w-v”^||| . 

We can now bound the distance between (w,c) and (v™', 6 ™'), 
||(v-, 6 -)-(w,c)f = 

< 

< 


|v”^-wf+ | 6 ™-c 

|,,m I d2 


- wr(l + B^) + 255||v™ - w|| + 5^ 

,2 


We have shown that for any hyperplane H (w, c) that achieves a margin larger than M — 5 
on the support points of the maximum margin hyperplane, x G C~^ U C~ , the distance 
between (w, c) and (v™',6™') is less than e. Equivalently, any hyperplane H{'w,c) such that 
II (w, c) — (v™, 6™') II > e has a margin less than M — 5, as min ||w • x — c| | x G C~^ U C- } < 
M — 5. By symmetry, the same holds for any (w, c) within distance e of (—v™', —6™). 

By Lemma 4 3hi > 0 such that for all h G (0,/ii), the minimum density hyperplane 
for h, H(v^,b^), induces the same partition of X as the maximum margin hyperplane, 
iL(v”^, h^). By Lemma 3 3 /i 2 > 0 such that for all h G (0, /i 2 ), margin Lf(v^, b\) > M — 5. 
Therefore for h G (0, min{/ii,/i 2 }), min{||(v^, 6 ^) — (v™-, 6 ™')||, ||(v^, 6 ]!^) + (v™, 6 ™')||} < e. 
Since e > 0 was arbitrarily chosen, this gives the result. 
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